{"id":45093,"date":"2023-10-20T15:41:05","date_gmt":"2023-10-20T15:41:05","guid":{"rendered":"http:\/\/startupsmart.test\/2023\/10\/20\/could-this-tool-for-the-dark-web-fight-human-trafficking-and-worse-startupsmart\/"},"modified":"2023-10-20T15:41:05","modified_gmt":"2023-10-20T15:41:05","slug":"could-this-tool-for-the-dark-web-fight-human-trafficking-and-worse-startupsmart","status":"publish","type":"post","link":"https:\/\/www.startupsmart.com.au\/uncategorized\/could-this-tool-for-the-dark-web-fight-human-trafficking-and-worse-startupsmart\/","title":{"rendered":"Could this tool for the dark web fight human trafficking and\u00a0worse? &#8211; StartupSmart"},"content":{"rendered":"<div class=\"td-post-featured-image\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"332\" itemprop=\"image\" class=\"entry-thumb\" src=\"\/wp-content\/uploads\/2017\/01\/Bullet-crime-dark-grim-violence-640x332.jpg\" alt=\"Bullet\" title=\"bullet-crime-dark-grim-violence\"><\/div>\n<p>In today\u2019s data-rich world, companies, governments and individuals want to analyse anything and everything they can get their hands on \u2013 and the World Wide Web has loads of information.<\/p>\n<p>At present, the most easily indexed material from the web is text.<\/p>\n<p>But as much as 89 to 96 percent of the content on the internet is actually something else \u2013 images, video, audio, in all thousands of different kinds of non-textual data types.<\/p>\n<p>Further, the vast majority of online content isn\u2019t available in a form that\u2019s easily indexed by electronic archiving systems like Google\u2019s.<\/p>\n<p>Rather, it requires a user to log in, or it is provided dynamically by a program running when a user visits the page.<\/p>\n<p>If we\u2019re going to catalog online human knowledge, we need to be sure we can get to and recognise all of it, and that we can do so automatically.<\/p>\n<p>How can we teach computers to recognise, index and search all the different types of material that\u2019s available online?<\/p>\n<p>Thanks to federal efforts in the global fight against human trafficking and weapons dealing, my research forms the basis for a new tool that can help with this effort.<\/p>\n<h3>Understanding what\u2019s deep<\/h3>\n<p>The \u201cdeep web\u201d and the \u201cdark web\u201d are often discussed in the context of scary news or films like \u201cDeep Web,\u201d in which young and intelligent criminals are getting away with illicit activities such as drug dealing and human trafficking \u2013 or even worse.<\/p>\n<p>But what do these terms mean?<\/p>\n<p>The \u201cdeep web\u201d has existed ever since businesses and organisations, including universities, put large databases online in ways people could not directly view.<\/p>\n<p>Rather than allowing anyone to get students\u2019 phone numbers and email addresses, for example, many universities require people to log in as members of the campus community before searching online directories for contact information.<\/p>\n<p>Online services such as Dropbox and Gmail are publicly accessible and part of the World Wide Web \u2013 but indexing a user\u2019s files and emails on these sites does require an individual login, which our project does not get involved with.<\/p>\n<p>The \u201csurface web\u201d is the online world we can see \u2013 shopping sites, businesses\u2019 information pages, news organisations and so on.<\/p>\n<p>The \u201cdeep web\u201d is closely related, but less visible, to human users and \u2013 in some ways more importantly \u2013 to search engines exploring the web to catalog it.<\/p>\n<p>I tend to describe the \u201cdeep web\u201d as those parts of the public internet that:<\/p>\n<ol>\n<li>Require a user to first fill out a login form,<\/li>\n<li>Involve dynamic content like AJAX or Javascript, or<\/li>\n<li>Present images, video and other information in ways that aren\u2019t typically indexed properly by search services.<\/li>\n<\/ol>\n<h3>What\u2019s dark?<\/h3>\n<p>The \u201cdark web,\u201d by contrast, are pages \u2013 some of which may also have \u201cdeep web\u201d elements \u2013 that are hosted by web servers using the anonymous web protocol called Tor.<\/p>\n<p>Originally developed by US Defence Department researchers to secure sensitive information, Tor was released into the public domain in 2004.<\/p>\n<p>Like many secure systems such as the WhatsApp messaging app, its original purpose was for good, but has also been used by criminals hiding behind the system\u2019s anonymity.<\/p>\n<p>Some people run Tor sites handling illicit activity, such as drug trafficking, weapons and human trafficking and even murder for hire.<\/p>\n<p>The US&nbsp;government has been interested in trying to find ways to use modern information technology and computer science to combat these criminal activities.<\/p>\n<p>In 2014, the Defence Advanced Research Projects Agency (more commonly known as DARPA), a part of the Defence Department, launched a program called Memex to fight human trafficking with these tools.<\/p>\n<p>Specifically, Memex wanted to create a search index that would help law enforcement identify human trafficking operations online \u2013 in particular by mining the deep and dark web.<\/p>\n<p>One of the key systems used by the project\u2019s teams of scholars, government workers and industry experts was one I helped develop, called Apache Tika.<\/p>\n<h3>The \u2018digital Babel fish\u2019<\/h3>\n<p>Tika is often referred to as the \u201cdigital Babel fish,\u201d a play on a creature called the \u201cBabel fish\u201d in the \u201cHitchhiker\u2019s Guide to the Galaxy\u201d book series.<\/p>\n<p>Once inserted into a person\u2019s ear, the Babel fish allowed her to understand any language spoken.<\/p>\n<p>Tika lets users understand any file and the information contained within it.<\/p>\n<p>When Tika examines a file, it automatically identifies what kind of file it is \u2013 such as a photo, video or audio.<\/p>\n<p>It does this with a curated taxonomy of information about files: their name, their extension, a sort of \u201cdigital fingerprint&#8221;.<\/p>\n<p>When it encounters a file whose name ends in &#8220;.MP4&#8221; for example, Tika assumes it\u2019s a video file stored in the MPEG-4 format.<\/p>\n<p>By directly analysing the data in the file, Tika can confirm or refute that assumption \u2013 all video, audio, image and other files must begin with specific codes saying what format their data is stored in.<\/p>\n<p>Once a file\u2019s type is identified, Tika uses specific tools to extract its content such as Apache PDFBox for PDF files, or Tesseract for capturing text from images.<\/p>\n<p>In addition to content, other forensic information or &#8220;metadata\u201d is captured including the file\u2019s creation date, who edited it last, and what language the file is authored in.<\/p>\n<p>From there, Tika uses advanced techniques like Named Entity Recognition (NER) to further analyse the text.<\/p>\n<p>NER identifies proper nouns and sentence structure, and then fits this information to databases of people, places and things, identifying not just whom the text is talking about, but where, and why they are doing it.<\/p>\n<p>This technique helped Tika to automatically identify offshore shell corporations (the things); where they were located; and who (people) was storing their money in them as part of the Panama Papers scandal that exposed financial corruption among global political, societal and technical leaders.<\/p>\n<figure style=\"width: 754px\" class=\"wp-caption alignnone\"><figcaption class=\"wp-caption-text\">                             Tika extracting information from images of weapons curated from the deep and dark web. Stolen weapons are classified automatically for further follow-up.                           <\/figcaption><\/figure>\n<h3 class=\"align-center zoomable\">Identifying illegal activity<\/h3>\n<p>Improvements to Tika during the Memex project made it even better at handling multimedia and other content found on the deep and dark web.<\/p>\n<p>Now Tika can process and identify images with common human trafficking themes.<\/p>\n<p>For example, it can automatically process and analyse text in images \u2013 a victim alias or an indication about how to contact them \u2013 and certain types of image properties \u2013 such as camera lighting.<\/p>\n<p>In some images and videos, Tika can identify the people, places and things that appear.<\/p>\n<p>Additional software can help Tika find automatic weapons and identify a weapon\u2019s serial number.<\/p>\n<p>That can help to track down whether it is stolen or not.<\/p>\n<p>Employing Tika to monitor the deep and dark web continuously could help identify human- and weapons-trafficking situations shortly after the photos are posted online.<\/p>\n<p>That could stop a crime from occurring and save lives.<\/p>\n<p>Memex is not yet powerful enough to handle all of the content that\u2019s out there, nor to comprehensively assist law enforcement, contribute to humanitarian efforts to stop human trafficking and even interact with commercial search engines.<\/p>\n<p>It will take more work, but we\u2019re making it easier to achieve those goals.<\/p>\n<p>Tika and related software packages are part of an open source software library available on DARPA\u2019s Open Catalog to anyone \u2013 in law enforcement, the intelligence community or the public at large \u2013 who wants to shine a light into the deep and the dark.<\/p>\n<p><em>Christian Mattmann is a principal data scientist and the chief architect in the Instrument and Data Systems section at the Jet Propulsion Laboratory in California.&nbsp;<\/em><\/p>\n<p><em>This article was originally published on The Conversation.&nbsp;<\/em><\/p>\n<p><em>Follow StartupSmart on&nbsp;Facebook,&nbsp;Twitter,&nbsp;LinkedIn&nbsp;and iTunes.&nbsp;<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In today\u2019s data-rich world, companies, governments and individuals want to analyse anything and everything they can get their hands on<\/p>\n","protected":false},"author":1,"featured_media":58925,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10,20,1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/posts\/45093"}],"collection":[{"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/comments?post=45093"}],"version-history":[{"count":0,"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/posts\/45093\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/media\/58925"}],"wp:attachment":[{"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/media?parent=45093"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/categories?post=45093"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.startupsmart.com.au\/wp-json\/wp\/v2\/tags?post=45093"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}