This interview was conducted by on Saturday, July 13, 2019 (editing for brevity and language jointly by the interviewer and interviewee.)
Jochen Leidner: Today, I have on the line Shadi Saleh, co-founder of the first-ever Web search engine in Syria. I met Shadi at ECIR 2019 [the 41st European Conference on Information Retrieval] in Cologne, and when I heard his story, I asked him whether he would share his entrepreneurial journey with us. Good afternoon, Shadi, Are you ready? Welcome, and thanks for agreeing to be interviewed; would you share some details about yourself to get started, please?
Shadi Saleh: Yes, I’m ready, thanks. I’m Shadi Saleh, 31 years old, from Latakia, Syria. I finished my software engineering degree in Syria at Tishreen University before I moved to the Czech Republic.
Jochen Leidner: You started a project that became Syria‘s first Web search engine. Tell us a little about the background and context to the story, please. How did it all start?
Shadi Saleh: Actually, while I was studying software engineering I was very active in competitive programming, notably on the International Olympiad in Informatics and the ICPC International Collegiate Programming Contest. In 2012, when I was attending the International Olympiad in Italy, I met one member of the Scientific committee of IOI. He was a professor from Charles University in Prague. He happened to share some data during our lunch break, when I talk to him about myself and my interests in Natural Language Processing, especially in Arabic NLP. He mentioned the Institute of Applied Program Linguistics at Charles University, and that they had a project concerning the Arabic language, the Prague Arabic Dependency Treebank. Actually, that was the first time I thought about Prague. So, once I returned to Syria, I did some research, and liked what I found about the institute and Charles University, and I liked what I read about Prague. So I applied for a visa, they gave it to me, and after a few months I moved to Prague. My first half year in Prague was spent on a language course. Then I started my Ph.D. in cross-lingual information retrieval at the Institute of Applied Program Linguistics, and here I am.
Jochen Leidner: Fantastic. Obviously in science, you can often meet interesting people at conferences around the world. That’s also how you and I met. Now I’m kind of curious, when and how did you first get this idea of building your own Web search engine?
Shadi Saleh: Actually from the very early beginning when I moved to Prague. I remember I went to buy a SIM card, and I asked the guy in the shop to make sure that the Internet is working on my mobile phone. I noticed he didn’t do what we always do, i.e. type “google.com”. He went to seznam.cz, the Czech Web search engine.
I thought, “people here really like Seznam, their own engine this is the first option for them.” Everyone there is using Seznam as their default service. So I thought, “Okay, the Czech people built their own search engine. They put some effort to customize, localize it. It has a remarkable and sustainable impact on its users.”
“So I thought, ‘Okay, the Czech people built their own search engine. They put some effort to customize, localize it. They process and impact users.’”
– Shadi Saleh
Shadi Saleh: After about one year, I was approached by Google about a software engineering position. For this, I would have had to give up pursuing my Ph.D. During that time I was also thinking about doing a project in Syria. I was not quite sure what to do, but what helped me was that I had some colleagues there who had been working with me; we had run some outsourcing projects for people in Europe.
Shadi Saleh: So, I had a team, and we were passionate about doing something interesting, and I said, “Maybe building a search engine would be an interesting idea since I am studying [information retrieval].” It started as a set of experiments. We didn’t expect to reach this stage where we are now. I even didn’t want to do it myself at first, because I had some connection with Seznam, I tried to convince them to expand into Syria. I told them, ” There are 500 million Arabic users, and we don’t have any form of an Arabic search engine, so let’s do it together, I can help you to enter the market and you will help using your experience and platform.” But they told me, “We don’t want to get distracted. We want to focus on the Czech market.” Even in Slovakia, which is very close to Czechia in terms of the language and the culture, Seznam puts less effort in. It’s not easy for them to be the number one search engine in Czechia all the time. They told me it’s not difficult for you to try some open-source products, tools, and experiment it.
Jochen Leidner: That’s the spirit.
Shadi Saleh: At that time, I was talking to a researcher who was working in my department, and who encouraged me to start doing this without any formal support from Seznam. I told my team that I’m not saying that we are going to build a search engine, but I want to start some experiments. Let’s see what we can do and where it might go. At that time, we did some market research about what people needed in Syria, what are the obstacles, what do they expect from a local search engine, what does mostly interest them, and we started within a four-month project. After a very short period we realized that we can do something special.
Jochen Leidner: How did you do the market research? Did you interview random people in the street? Did you go to a company? It’s quite interesting because a lot of start-ups, they just jump right in on the technical side, and in the business plan literature they always recommend you write a formal business plan. It sounds like you were doing this somewhat professionally by doing market research, but how did you go about this in practice?
Shadi Saleh: Actually it was not easy to perform market research there. When we are talking about a market like Syria, where it’s very hard to find information online. I relied mostly on two things: first, top ranked Web-sites in Syria from Alexa [alexa.com, ed.], which gave me a lot of nice information. I’m very grateful to them. For example, when you go to Alexa, you can sort Web-sites by country. When you visit each profile on Alexa, you see relevant keywords that people usually use to go there.
Shadi Saleh: We looked at the first 100 websites in Syria, and we came up with some conclusions, like okay, people are interested in this and that. They want to read the news. They are interested in shopping, especially for these products. They like to search for movies or songs, academic content, jobs. Jobs-listing service was really a strong indicator for us. Also, I bought some data from strong-performing sites in Syria in the form of Web statistics, with device statistics that are used in Syria. We converted this information into action. We noticed that for example that among 100,000 users a day, a mere 2% of them were using the iPhone. So, we decided not to distract ourselves by building an iPhone application. It will cost a lot, instead, we will build an Android application, and we will take great care to build a responsive website, because people there seemed to use browsers from tablets mostly when surfing the Web.
Jochen Leidner: Your engine is called “Shamra”. What does it actually mean, and how did you come up with the name?
Shadi Saleh: Well, frankly it was very hard to find a name for the project. First, we tried Syndex, as in “the Syria Index”. Until today, there are many libraries in our code named “Syndex”. We stuck with this name for four months before we went online, during the development. But we didn’t use it publicly. I knew that naming the search engine “Syndex” would be the first mistake in our strategy as we wanted to localize it and give a feeling to our users that we’re close to them.
Shadi Saleh: I was always amazed about the discovery in the 1920s of one of the first alphabets, which was discovered in Ras Shamra. It’s a coastal village close to my city, Latakia. So, I chose the name “Shamra”. We wanted to send a message that “We will improve our existence in the digital era after 8,000 years when our ancestors created the first alphabet of the world.” I was very lucky with that name. We created a Facebook page called “Shamra”, without any content on it. We just wanted to reserve the name. In the first two, three, days we got a lot of “likes”. More than 5,000 people liked the page because they liked the name, and they felt that it’s close to them. Then I knew that this is the right name.
Shadi Saleh: When I asked a designer to propose a meaningful relevant logo to go with it he offered me very nice logo: if you look at Shamra in Arabic, you will notice that there’s a key on the right side (S letter in Arabic), and there’s a magnifying glass in the middle, which means that Shamra is the key to the Arabic digital content in Syria and magnifies – focuses – on searching Syrian content.
Jochen Leidner: As a computational linguist, I like the name. Ras Shamra, to those readers who don’t know it, is a site near the city of Ugarit, in Northern Syria, a port town where in 1928, a set of Ugaritic texts were discovered by a French archaeologist.
Shadi Saleh: Exactly.
Jochen Leidner: And that’s what you named it, a very inspiring name.
Shadi Saleh: Yes. When I first heard about Shamra, I was perhaps seven years old, we went there on a school trip. I was amazed by it.
Jochen Leidner: Nice linguistic connection indeed. How long have you been running it now?
Shadi Saleh: Shamra went online officially on September 6, 2015, so, it has been online for about three years and ten months. We have around 21,000 registered users. They created accounts, they use mobile applications, and they visit Shamra regularly. And we have around 150,000 unique visitors every month. The figures change significantly according to what’s going in the country. For example, during periods of big breaking news or a fluctuation in the exchange rates, it reaches half million users.
Jochen Leidner: Many readers of this will be rather technical, they may be IR [Information Retrieval] experts. What did you build first? Can you speak about the first build of your project? Did you first create an infrastructure for large scale indexing and retrieval, or did you just create a very small local index and just put it up as a Web server?
Shadi Saleh: Yeah, actually the first thing we started, with some research, how many servers we want, and how to make things scalable. Like if today we design infrastructure for 100,000 users, what if tomorrow we wanted to expand it? it should be scalable on-the-fly. We didn’t want to go offline and we didn’t want to have to change the infrastructure. So, we put a lot of effort into that. First, we created a cluster for our Web server, a file server cluster, and a cluster of database servers, with load balancers for each. and then we created clusters for Hadoop, for crawlers, and a cluster for Elasticsearch with node servers and with multiples nodes. This is the first time I mention Elasticsearch, which is really amazing. It’s very fast, and the documentation online is great. We made some contributions to the open source code-base, and we are using Nutch as our crawler.
Shadi Saleh: We implemented a lot of libraries for Arabic NLP. As we conducted some research regarding the existing tools, we decided not to apply them since we were not happy about the results, you know some NLP tools can be very good but when you apply them in information retrieval, performance will degrade. We started with a very small let me call it “data center“ [laughs] to do crawling, indexing and experimental searches, to see if we have the relevant documents in the index to cater for a set of diverse queries from different domains we prepared before. We realized we were happy about the speed of the retrieval system, the precision of retrieval, so we started implementing more of the IR infrastructure, and indexing a huge amount of documents and websites.
Jochen Leidner: Great. That’s fascinating. You mentioned Arabic NLP, and I know there’s been a lot of research done, by people like Ken Beesley [now at SAP Labs, ed.] and many others. You mentioned you had to build some custom NLP. How, in your opinion, is the state of Arabic open-source NLP components that you were able to use directly versus having to build your own?
Shadi Saleh: Unfortunately, there are some state of the art methods for what we really want to put, like stemming algorithm for retrieval to reduce the space for search, but the existing algorithms did not work well for IR. So, at the end we decided to use our own rule-based algorithm and not some state of the art methods that may give us the best accuracy.
As you know the spoken Arabic language in Egypt is different from Syria. if we look at some work from the research community regarding Arabic NLP, you will see that it came from Egypt or from Qatar or Saudi Arabia, so we needed adjustment. So, yeah, we noticed that we have to build our own, both for performance reasons, and to accommodate the spoken language in Syria.
Jochen Leidner: What type of search queries are particularly popular among Syrians?
Shadi Saleh: Well, first of all it has been number one since almost we started: exchange rates! Exchange rates between dollar and Syrian pound. The situation, because as you know we have an economic crisis, and as a resort for the situation there, so exchange rates change almost every day. And since there a lot of people outside Syria, they send money to their relatives or they are interested in the exchange rate. We noticed that from the beginning. If you go to Google for example to try to find exchange rates, Google doesn’t have access to accurate information from our Central Bank, so Google started to show exchange rates results from our Web-site. Look at this photo, I searched in Google exchange rates in Syria, and I got a nice table fetched from Shamra, so basically, Google considers Shamra as a reliable source of information regarding this query. It is not the case in different currencies, right? for example if you search Google for exchange rates between USD and EUR it will give you accurate data, so it does not make sense to implement such a service because you will not bring any added value.
Shadi Saleh: So, what we did, we designed let’s say a service that gives you the exchange rate. It will generate some charts to see how changes last period from 2015 until now. It’s really funny how people sometimes go to Google and write, “Shamra exchange rate” So, there is a “brand“ now, the “Shamra exchange rate”, people trust it, and they like it, so it has been number one type of query. I’m talking here about two types of queries, queries people type on Google and they come to us, and queries people type in our search engines directly.
Jochen Leidner: Right. So, I guess Google is delivering you a lot of added traffic to you just because people hear from other people about Shamra. That’s nice that the marketing is working without the marketing budget. I guess that’s a testimony to the quality of your work.
Shadi Saleh: Yes – about 15,000 users come from Google search every month. We use Google Console to monitor this.
Shadi Saleh: To your question, exchange rates, numbers one. Jobs, people are really interested in finding jobs in the area. How to get some government documents, e.g. how to renew my passport, how to issue some cards. And queries related to retail service, how to obtain a proof that I am a student. Also, some queries depend on the time, like TV series, or religious queries, especially during Ramadan. Queries about proxies and VPN. We are crazy about proxies and VPN because, as you know, Syria is a sanctioned country, so there are lots of services you cannot visit, or tools you can not download as developers, so they have to use proxies to access these sites abroad. We have a lot of queries about academic content that is hosted on our server. We implemented a service called Shamra Academia. as we had noticed that there was not something like Google Scholar that focused on academic (Arabic) content. So we designed Shamra Academia.
Shadi Saleh: Also people are looking for places, restaurants, hospitals, clinics. We have something like a Google Maps called Shamra Places. I think we have the best data in the world when it comes to Syria. Last month as I remember, at least two doctors contacted us to update their information and their profiles on Places, because they tried to look up their names on Google. They found that the first results are from Shamra Places. They wanted to make sure that there’s valid information on it. It’s true that we have a system where you can suggest and edit your information yourself. but some people find it easier to contact us by e-mail.
Jochen Leidner: In a country like Syria, which has sadly suffered from a lot of military conflict, what are the things that are missing or hard to get specifically in the area of IT? Where you impeded by some of this, or was it quite straight-forward to build Shamra?
Shadi Saleh: This is a very interesting question, and the answers might be shocking for some. In Syria, the IT infrastructure is too much behind. For example, we don’t have any e-payment system, which means that we are completely excluded in terms of internet business models. Again, when it comes to implementing a system, all the business models that rely on e-payment systems just cannot be used in Syria, like Web-based shopping. Syria is a sanctioned country, which means like Facebook and Google don’t allow business partners to promote their content by using Google ads or Facebook Business. The same applies to software products. You cannot get any original copy of any operating system like [Microsoft] Windows. You cannot buy SSL certificate for your Web-sites.
Shadi Saleh: I would like to thank Let’s Encrypt. Since our domain is registered in Syria, I tried a lot to buy SSL certificate but I couldn’t, so we are using now Let’s Encrypt, which provides free SSL certificates.
Jochen Leidner: This is quite important. To you, these things are mission-critical. So the sanctions, they hit very hard those people who are trying to do good things and to try to get Syria back on track and help the country, and entrepreneurs like you and your team.
Shadi Saleh: Also when it comes to servers and the hardware, you will see in the terms of service that, “This product cannot be used in Syria,” or whatever it is, so that was the main obstacle actually we were trying to solve. We worked around it by using open-source software like Linux, of course, which can be used anywhere.
Jochen Leidner: How about network bandwidth? Is the country well connected to serve its own population with regular daily Web queries, or for larger scale?
Shadi Saleh: if we are talking about using the Internet to read the news or to search for some queries, I think it’s good, things are getting better. It’s been one of the worst Internet service in the world according to some statistics. But we have noticed an improvement during the last two years. This year, 4G came to Syria and is available now. It’s really expensive, but still some people can have high-speed access to the Internet. Also, recently we started to use fiber-optic cables, which is also improving the Internet. I will say it’s not perfect for video streaming or similar services that require fast internet, but for surfing the Internet it’s enough.
Jochen Leidner: Over here in Europe there was an attempt to create a Web search engine as a public alternative to Google by a consortium. That has ultimately failed. What are the difficulties pulling off such a search project, and why could you be successful after all? What were the really hard parts?
Shadi Saleh: It’s hard to say for me. I’m not aware of such attempts. I know some, like there was a French search engine. I forgot its name, like “a search engine that protects your privacy”, that was their slogan, but I didn’t take a closer look at it. But I can tell you some small features that a search engine should have to bring added value to our users.
Jochen Leidner: The French search company Qwant, probably?
Shadi Saleh: Yeah, Qwant, exactly. Yeah. They were sponsors of CLEF [the Cross Language Evaluation Forum, ed.] recently, 2018 if I’m not mistaken? I’m not aware actually of the reasons behind that. but let me tell you as I mentioned, some points that would make you fail if you ignore them. First, focusing on Syrian content, customizing the crawlers to reach limited and narrow amount of Web-sites. It helped us a lot to customize the results. When you are trying to develop a search engine for a huge marker with multiple user interest and multiple languages, it will be really challenging, so you have to focus on a small market. Also, we realized at the very early beginning that hosting the content, or let me say design a platform for content creation will help us to optimize the search much better. You know in information retrieval that it’s very easy to look up information that you own and you structure it in your preferable way than dealing with raw, [unstructured., ed.] data that is hosted in different websites, each website follows different web schema and in most cases, not standard one
Shadi Saleh: To remedy this, we created content platforms. For example for jobs, we have Shamra Jobs. Companies can post jobs, and users can create their profile. So, the platform allows them to matching or finding jobs easier than just crawling some websites which we don’t have control over. For academic content, there’s Shamra Academia to host scientific publications, as I explained. We also created a platform for e-commerce or product listings. And then there are news, exchange rates, the weather, places. Our ultimate goal is to find relevant information that is created and verified and hosted on our platform.
Jochen Leidner: Fantastic Shadi. Thank you so much for your time, for sharing your exciting journey. All the best for Shamra, your and your team, and for writing up of your Ph.D. thesis on cross-lingual information retrieval. Thanks a lot again for taking the time to be interviewed, and have a great rest of the weekend.
Shadi Saleh: Thank you so much. I am very grateful for the interview and the interesting questions – I liked them. And thanks again for giving me the opportunity to talk about my project.
Published under a Creative Commons License (CC-SA).